EuroGOV: Engineering a Multilingual Web Corpus
نویسندگان
چکیده
EuroGOV is a multilingual web corpus that was created to serve as the document collection for WebCLEF, the CLEF 2005 web retrieval task. EuroGOV is a collection of web pages crawled from the European Union portal, European Union member state governmental web sites, and Russian government web sites. The corpus contains over 3 million documents written in more than 20 different European languages. In this paper we provide a detailed description of the EuroGOV collection.
منابع مشابه
Web Retrieval Experiments with the EuroGOV Corpus at the University of Hildesheim
In the CLEF 2005 initiative, multlingual web retrieval was integrated as a task for the first time. This paper describes experiments based on one multilingual index carried out at the University of Hildesheim. Several indexing strategies based on a multi-lingual index have been tested with the EuroGOV corpus. Boosting topic fields with higher weight led to best results during post submission ru...
متن کاملDiscovering Parallel Text from the World Wide Web
Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents ...
متن کاملLexical Database for Multiple Languages: Multilingual Word Semantic Network
Data mining and knowledge engineering have become a tough task due to the availability of large amount of data in the web nowadays. Validity and reliability of data also become a main debate in knowledge acquisition. Besides, acquiring knowledge from different languages has become another concern. There are many language translators and corpora developed but the function of these translators an...
متن کاملDemo of iMAG Possibilities: MT-postediting, Translation Quality Evaluation, Parallel Corpus Production
An interactive Multilingual Access Gateway (iMAG) dedicated to a web site S (iMAG-S) is a good tool to make S accessible in many languages immediately and without editorial responsibility. Visitors of S as well as paid or unpaid post-editors and moderators contribute to the continuous and incremental improvement of the most important textual segments, and eventually of all. Pre-translations are...
متن کاملA Multilingual Information Retrieval Tool Hierarchy for a WWW "Virtual Corpus"
The article addresses: 1. the design of an information retrieval (IR) toolkit, named as the Multilingual Information Retrieval Tool Hierarchy (MIRTH) search engine, which works with virtual corpora on the World Wide Web, also known as the Web or WWW for short. It is motivated by the desire to create a multilingual search engine to retrieve information by accessing a virtual corpus; 2. the imple...
متن کامل